
A side project to find out which position is the most popular among the fans and which position has the biggest chunk of market values are also included in this report. This side project is just for fun.
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
if pd.__version__ != "0.23.3":
print("Your pandas is not the version 0.23.3, there are cells you might be unable to execute due to different version")
if sns.__version__ != "0.9.0":
print("Your seaborn is not the version 0.9.0, there are cells you might be unable to execute due to different version")
%matplotlib inline
matplotlib.style.use('ggplot')
from pylab import rcParams
sns.set(style="ticks")
rcParams['figure.dpi'] = 350
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['patch.edgecolor'] = 'white'
rcParams['font.family'] = 'StixGeneral'
rcParams['figure.figsize'] = 10,8
rcParams['font.size'] = 20
# rcParams['axes.labelsize'] = 'large'
rcParams['xtick.labelsize'] = 20
rcParams['ytick.labelsize'] = 20
premier_league_df = pd.read_csv("epldata_final.csv")
pl_1718_table_df = pd.read_excel("premier_league_17_18_table.xlsx", index_col="Pos")
Since two datasets have different column names, first step is to make the naming convention consistent.
We will follow two rules for clarity:
pl_1718_table_df.columns = ["club", "played", "won", "drawn", "lost", "goals", "goals against", "goal difference", "points"]
pl_1718_table_df.head(3)
Firstly, drop the redundant columns: "club id" and "age category"
premier_league_df.drop(["age_cat", "club_id"], axis=1, inplace=True)
Afterwards we will remove all the underscores
premier_league_df.columns = ["name", "club","age", "position",
"position category", "market value(Mill)",
"wiki views(daily)", "fpl value", "fpl select(%)",
"fpl points", "region category", "nationality",
"new foreign", "big club", "new signing"]
premier_league_df.head(3)
As this project is for exploratory data analysis only, meaningful string will be more intuitively than numeric code
# First to check if anyother category exist
premier_league_df["position category"].unique()
def change_position_cat_to_string(x):
if type(x) == str:
return x
switcher = {
1: "attackers",
2: "midfielders",
3: "defenders",
4: "goalkeepers"
}
return switcher.get(x, "Invalid category")
premier_league_df["position category"] = premier_league_df["position category"].apply(change_position_cat_to_string)
premier_league_df.head(3)
premier_league_df["region category"].unique()
We can see that there are invalid value inside "region category" column. Next we want to locate those value can change them to consitent value.
# TODO: Ask Teo, seems chain assignment doesn't work
idx = premier_league_df[premier_league_df["region category"].isnull()].index
premier_league_df[premier_league_df["region category"].isnull()]
_ = premier_league_df.set_value(index=idx, col="region category", value= 4)
def change_region_cat_to_string(x):
if type(x) == str:
return x
switcher = {
1: "England",
2: "EU",
3: "Americas",
4: "Rest of World"
}
return switcher.get(x, "Invalid category")
premier_league_df["region category"] = premier_league_df["region category"].apply(change_region_cat_to_string)
premier_league_df.head(3)
premier_league_df["big club"] = premier_league_df["big club"].apply(lambda x: True if x == 1 else False)
premier_league_df["new foreign"] = premier_league_df["new foreign"].apply(lambda x: True if x == 1 else False)
premier_league_df["new signing"] = premier_league_df["new signing"].apply(lambda x: True if x == 1 else False)
premier_league_df.head(3)
The original data set column "new foreign" doesn't count as new signing which doesn't make sense.
premier_league_df["new signing"] = (premier_league_df["new signing"] == True) | (premier_league_df["new foreign"] == True)
Since the data come from two dataset, the club names aren't consistent.
pl_1718_table_df.head(3)
club_name_alphabet_s = pl_1718_table_df["club"].sort_values().reset_index(drop=True)
def change_clubs_name(x):
idx=0
for name in premier_league_df["club"].unique():
if x != name:
idx += 1
else:
break
if x != club_name_alphabet_s[idx]:
x = club_name_alphabet_s[idx]
return x
premier_league_df["club"] = premier_league_df["club"].apply(change_clubs_name)
premier_league_df["club"].unique()
Now the club names are follow the same convention: full name with space as separator.
club_mean_df = premier_league_df.groupby("club").mean().drop(["fpl value", "fpl points"], axis=1).sort_values("market value(Mill)", ascending=False)
club_mean_df.head(3)
Firstly let's see the big picture of the dataset before we get closer.
# Reset index to run on the labs machine which has lower version pandas
club_mean_merge_df = pd.merge(club_mean_df.reset_index(), pl_1718_table_df, on="club").drop(["played"], axis=1)
club_mean_merge_df.head(3)
Change the column names to more intuitive names
club_mean_merge_df.columns = ["club", "avg age", "avg market value(Mill)", "avg wiki views(daily)", "new foreign portion", "big club", "new signing portion", "won", "drawn", "lost", "goals", "goals against", "goal difference", "points"]
axs = pd.plotting.scatter_matrix(club_mean_merge_df[["avg age", "avg market value(Mill)", "avg wiki views(daily)", "new signing portion", "goal difference", "points"]], diagonal="hist")
n = 6
for x in range(n):
for y in range(n):
# To get the axis of subplots
ax = axs[x, y]
# To make x axis labels horizontal
ax.xaxis.label.set_rotation(90)
# To make y axis labels horizontal
ax.yaxis.label.set_rotation(0)
# To make sure y labels are outside the plot
ax.yaxis.labelpad = 55
plt.suptitle("Scatter matrix among the average key features")
Note: Goal difference is the difference between the goals and the goal against a club has
Let's take a look at the correlations to help understanding the scatter matrices plot more numerically. The red colour outstanding the strong correlations greater than 0.8 or less than -0.8.
def color_strong_corr_red(val):
"""
Takes a scalar and returns a string with
the css property `'color: red'` for strong correlations
, black otherwise.
"""
color = 'red' if (val > 0.8 or val < -0.8) else 'black'
return 'color: %s' % color
club_mean_merge_df.corr().style.applymap(color_strong_corr_red)
A obvious way to know which club has done a good job is through the points in standing table, which the greatest points club wins the champion.
Through the big picture observation we gain from the last section, we know there is a correlation between market values a club has and the points it gets at the end of the season. However, how much is the effect of market values a club has towards the ranking they achieve needs more investigations.
sns.set_style("whitegrid")
rcParams['figure.figsize'] = 15, 10
rcParams['font.size'] = 20
rcParams['axes.facecolor'] = 'white'
sorted_market_value_idx = premier_league_df.groupby(["club"])["market value(Mill)"].mean().sort_values(ascending=False).index
# Keep the rank
pl_1718_table_df["rank"] = pl_1718_table_df.index
# Reorder the table by market valur
rank_se = pl_1718_table_df.set_index(keys="club").reindex(sorted_market_value_idx)["rank"]
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
# plot_beautify(ax1)
# plot_beautify(ax2)
sns.boxplot(x="market value(Mill)", y="club", data=premier_league_df, order=sorted_market_value_idx, ax=ax1, )
sns.barplot(x="points", y="club", data=pl_1718_table_df, ax=ax2, order=sorted_market_value_idx)
for p, rank in zip(ax2.patches, rank_se):
width = p.get_width()
ax2.text(width + 3,
p.get_y() + p.get_height()/2.0,
rank,
ha="center"
)
ax2.set_ylabel("")
ax1.set_title("Market Values for Each Club")
ax2.set_title("Points and Rank Index")
f.tight_layout()
# TODO: Remove ax2 x axis number and add the points to the top of bar
As we sorted the club orders by the club mean market values, the points a club gained at the end of season is not strictly fit for the order. Chelsea have the second largest mean market value but only end up with fifth in the standings, even cannot join the UEFA Champions League next year. Same case happened on Southampton, with the second largest mean market value among the clubs that are not "the big 6" but end up with almost relegation standing.
sns.jointplot(premier_league_df.groupby(["club"])["market value(Mill)"].mean().sort_values(ascending=False), pl_1718_table_df.set_index(keys="club").reindex(sorted_market_value_idx)["points"], kind="reg")
plt.subplots_adjust(top=0.9)
plt.suptitle("Avg market value per club with the points it got")
From the join plot we can have two observations:
If we plot it in kde mode then we can clearly see two clusters in the chart.
sns.jointplot(premier_league_df.groupby(["club"])["market value(Mill)"].mean().sort_values(ascending=False), pl_1718_table_df.set_index(keys="club").reindex(sorted_market_value_idx)["points"], kind="kde")
plt.subplots_adjust(top=0.9)
plt.suptitle("Avg market value per club with the points it got")
Separate the premier league with two groups according to the plotting above. One is the "big clubs", the other one is the rest of the league.
big_6_index = pl_1718_table_df.set_index(keys="club").reindex(sorted_market_value_idx).index[:6]
sns.jointplot(premier_league_df[premier_league_df["big club"] == True].groupby(["club"])["market value(Mill)"].sum().sort_values(ascending=False), pl_1718_table_df.set_index(keys="club").reindex(sorted_market_value_idx).loc[big_6_index, "points"], kind="reg")
plt.subplots_adjust(top=0.9)
plt.suptitle("Big clubs")
other_clubs_index = pl_1718_table_df.set_index(keys="club").reindex(sorted_market_value_idx).index[6:]
sns.jointplot(premier_league_df[premier_league_df["big club"] == False].groupby(["club"])["market value(Mill)"].sum().sort_values(ascending=False), pl_1718_table_df.set_index(keys="club").reindex(sorted_market_value_idx).loc[other_clubs_index, "points"], kind="reg")
plt.subplots_adjust(top=0.9)
plt.suptitle("Small clubs")
After the separation, the linear regression function becomes less confident. especially for the clubs that are not "big clubs", the relationship among them cannot described by a linear function.
New signing can represent the ambition of a club for new season. Even the portion of new signing doesn't affect the ranking of a club much, but we can check if the market values associated with new signing player a better estimation.
new_signing_df = premier_league_df[(premier_league_df["new signing"] == True) | (premier_league_df["new foreign"] == True)].groupby("club").mean().sort_values(by="market value(Mill)", ascending=False)[["market value(Mill)", "new signing", "age", "wiki views(daily)"]]
sorted_new_signing_index = new_signing_df.index
# Reorder the table by market valur
rank_se = pl_1718_table_df.set_index(keys="club").reindex(sorted_new_signing_index)["rank"]
new_signing_merge_df = pd.merge(new_signing_df.reset_index(), pl_1718_table_df, on="club").drop(["new signing", "played"], axis=1)
Firstly plot the multivariate data to gain a big picture of the relationship between different features.
g = sns.PairGrid(new_signing_merge_df[["market value(Mill)", "age", "wiki views(daily)", "points", "goal difference"]])
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=6)
plt.subplots_adjust(top=0.9)
plt.suptitle("New Signing Players club average")
From the above plot we can see couple of interesting facts:
Note: Goal difference is the difference between the goals and the goal against a club has
new_signing_merge_df.corr().style.applymap(color_strong_corr_red)
sns.jointplot(new_signing_df["market value(Mill)"], pl_1718_table_df.set_index(keys="club").reindex(sorted_new_signing_index)["points"], kind="reg")
# plt.title("The relation betweent new assigning investment and points")
plt.subplots_adjust(top=0.9)
plt.suptitle("New Signing Players")
Compared to the all players average market value a club has join plot with the points it gained, the new signing player average market value show a better linear relationship.
sns.catplot(x="age", y="club", data=premier_league_df[premier_league_df["big club"] == True], hue="position category", height=6.5)
plt.subplots_adjust(top=0.9)
plt.suptitle("Big clubs age distribution over position")
attacker_average_df = premier_league_df[premier_league_df["position category"] == "attackers"].groupby("club").mean().sort_values("age")
attacker_average_merge_df = pd.merge(attacker_average_df, pl_1718_table_df, on="club").drop(["fpl value", "fpl points", "new foreign", "new signing", "played"], axis=1)
It is obviously that the age will not be the only factor that affects goals. A new factor that combine the age and the market values should be introduced.
attacker_average_merge_df["value/age"] = attacker_average_merge_df["market value(Mill)"]/ np.square(attacker_average_merge_df["age"])
def color_very_strong_corr_red(val):
"""
Takes a scalar and returns a string with
the css property `'color: red'` for strong correlations
, black otherwise.
"""
color = 'red' if (val > 0.9 or val < -0.9) else 'black'
return 'color: %s' % color
attacker_average_merge_df[["club", "age", "value/age", "market value(Mill)", "big club", "goals", "points"]].corr().style.applymap(color_very_strong_corr_red)
From the correlation table, we can see that, as we expected, the combination of value and age feature has a strong relationship to goals. Even more surprisingly we see a even stronger correlation with the points!
sns.jointplot(attacker_average_merge_df["value/age"], attacker_average_merge_df["points"], kind="reg")
plt.subplots_adjust(top=0.9)
plt.suptitle("All clubs")
sns.lmplot(x="value/age", y="points", data=attacker_average_merge_df, hue="club", fit_reg=False)
plt.subplots_adjust(top=0.9)
plt.suptitle("All clubs")
Is the market value normally distributed among all the position in the premier league. Which position is the most popular one and which position is dying?

position_mean_order_idx = premier_league_df.groupby("position").mean().sort_values("market value(Mill)", ascending=False).index
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
sns.boxenplot(x="market value(Mill)", y="position", data=premier_league_df, ax=ax1, order=position_mean_order_idx)
sns.boxenplot(x="wiki views(daily)", y="position", data=premier_league_df, ax=ax2, order=position_mean_order_idx)
ax2.set_ylabel("")
ax1.set_title("Market Values for Each Position")
ax2.set_title("Popularity")
f.tight_layout()
left midfield(LM) is the dying position in the 17/18 premier league. The players in that position has a surprising low market value median and the even sadder wikipedia views.Second striker(SS) is the favorite of the football fans. The daily wikipedia views for the player at that position is outstanding from the others.Attacking midfield(AM) the most valuable position in the premier league. The big chunk of players have arrange from 15 millions euro to 35 millions euro market values.Let's see who is the favorite second striker(SS) in premier league
premier_league_df[premier_league_df["position"] == "SS"].sort_values("wiki views(daily)", ascending=False).head(1)
It is the record goalscorer for the England national team and for Manchester United, Wayne Rooney.

g = sns.FacetGrid(premier_league_df, col="position category", height=4)
g.map(plt.hist, "market value(Mill)")
g = sns.FacetGrid(premier_league_df, col="position category", height=4)
g.map(plt.hist, "wiki views(daily)")
Attackers are the most valuable and popular position category in the premier league. On the contrary, the goal keepers are barely under the spotlight.
premier_league_df[premier_league_df["position category"] == "goalkeepers"].sort_values("market value(Mill)", ascending=False).head(3)
The best two goalkeepers in premier league only has 40 millions euro value at 17/18 season.